Maximizing Performance: Strategies for Code Optimization

Valeria Duran

What is Optimization?

An act, process, or methodology of making something (such as a design, system, or decision) as fully perfect, functional, or effective as possible. – Merriam-Webster

Code optimization is the process of enhancing code quality and efficiency.

xkcd comics

Pre-Optimization Steps

  • Make the code work.
  • Don’t repeat yourself.
  • Write clean code and document it well.
  • Don’t try to reinvent the wheel.

“Make it work, then make it beautiful, then if you really, really have to, make it fast. 90 percent of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful!” –Joe Armstrong

Optimization should be the final step of your programming practice. Performance can be high following the pre-optimization steps. Only optimize when it’s necessary!

Steps to Optimize Code

  1. Make sure code runs as expected, is clean, is well documented, and is created with the end user in mind.
  2. Profile code (identify where the slow code is)
  3. Find other solutions
    1. Vectorize code when possible
    2. Use parallelization techniques
    3. Cache frequently used data
    4. Manage memory
    5. Find a faster function/package
  4. Benchmark code (compare code with your solution)
  5. Execute!

Scenario 1: Working with Big Data!

Goal: build a model using the insurance claims of ~25 million lives with two years’ worth of data (i.e., billions of records!) to estimate the cost of a procedure.

Steps:

  1. Partition data into categories/sections (body region).
  2. Create R scripts with tidyverse on a small dataset (Houston, Texas population for 1 year).
  3. Apply same scripts to a larger dataset (Texas population for 1 year).
  4. Update scripts to use data.table instead of tidyverse.
  5. Increase Windows instance (increasing RAM from 16GB to 32 GB).  
  6. Run script on body regions (nationwide, and on two years of data).

Took months to complete…

…Other solutions are also possible!

Scenario 2: End User in Mind

# R Studio Server: 420ms

# R Studio Desktop Below

library(profvis)
library(git2r)

profvis({
  git_object <-
    function(data_object = NULL) {
      object_names <- sort(unique(subset(odb_blobs(), grepl(".rda", name))$name))
      
      git_obj <- grep(data_object, object_names, value = TRUE, ignore.case = TRUE)
      
      return(git_obj)
      
    }
  
  git_object("pkg")
})
library(microbenchmark)

microbenchmark(git_object <- sort(unique(subset(odb_blobs(), grepl(".rda", name))$name)),
               git_object2 <- sort(unique(gsub(".*/", "", system2("git" , "ls-files *.rda" , stdout = TRUE)))),
               times = 10)

… using base R system2() produces much faster results than git2r’s odb_blobs(). Sometimes the newer thing out there isn’t always the best option!

Useful R Tools

  • Profiling packages: profvis and profile.
  • Benchmarking package: microbenchmark and bench.
  • Caching packages: memoise and mustache
  • Parallel computing packages: snow and parallel.
  • On-disk packages: ff, bigmemory, and feather.
  • Use gc() (garbage collect) to release memory (not required, but doesn’t hurt to use after removing large objects).
  • Use package data.table for faster computations.

Can Optimizing Code Be a Bad Thing?

“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.”
- Donald Knuth, Computer Programming as an Art

When is Optimizing Code Bad?

  • When code becomes less readable.
  • When performance improvement is minuscule.
  • When the time needed to optimize is longer than the task at hand.
  • When code is not used frequently enough.

TIME!

Conclusion

  • When writing code, consider the end user. What the majority will use might not be what you use. This will save you a lot of future rework.
  • Trustworthy code is oftentimes better than fast code.
  • Don’t optimize unless you absolutely must.
  • Consider the biggest trade-off: time. Is it worth it?

Resources